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Abstract 

In model-based clustering and classification, the cluster-weighted model constitutes 
a convenient approach when the random vector of interest constitutes a response 
variable Y and a set p of explanatory variables X. However, its applicability may be 
limited when p is high. To overcome this problem, this paper assumes a latent factor 
structure for X in each mixture component. This leads to the cluster- weighted fac- 
tor analyzers (CWFA) model. By imposing constraints on the variance of Y and the 
covariance matrix of X, a novel family of sixteen CWFA models is introduced for 
model-based clustering and classification. The alternating expectation-conditional 
maximization algorithm, for maximum likelihood estimation of the parameters of 
all the models in the family, is described; to initialize the algorithm, a 5-step hierar- 
chical procedure is proposed, which uses the nested structures of the models within 
the family and thus guarantees the natural ranking among the sixteen likelihoods. 
Artificial and real data show that these models have very good clustering and clas- 
sification performance and that the algorithm is able to recover the parameters very 
well. 
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1 Introduction 



In direct applications of finite mixture models, each of the G mixture com- 
ponents is taken to repr esent a sub-group (or cluster) within the data (see 



Titterington et all Il985l . pp. 2-3). The terms 'model-based clustering' and 



'model-based classification' have been used to describe the adoption of mix- 
ture models or, more often, a family of mixture models for clustering and 

classification, respectively. In the 1990's, three model-b ased clustering papers 

(IBanfield and Rafteryl . ll993l . lCeleux and Govaertl . ll995l . and lGhahramani and Hinton 
19871 ) effectively set the scene for the push towards the finite mixture model- 
based approaches that followed. Overview s of m i xture models and t h eir ap - 
plications are given i n lEyeritt and Hand! (119811). ITitterington et al.l ( 119851 ). 
McLachlan and Peel! (l2000h . and IFruhwirth-Schnatterl (l2006h . 



Consider a random vector (X', Y)', defined from Q to MP x K, where a la- 
tent group-structure as well as a linear dependence of Y on x in each group 
are assumed. Under t hese assump t ions, the linear cluster-weighted model 
(CWM; introduced in iGershenfeldl . Il997l ) is an ideal choice within the mix- 
ture modelling framework. It factorizes the joint density of (X', Y) , in each 
mixture-component, into the product of the conditional density of Y\x and 
the marginal density of X. In this manner, the model takes into a c count 



the potential of finite mixtures of regressions (see IFruhwirth-Schnatterl . 12006 



Chapter 8) in modelling the conditional densit y of Y\x, and the pote ntial 



of finite mixtures of Gauss ian distributions (see ITitterington et al.l . Il985l and 
McLachlan and Peel . 2000l ) in modelling the joint density of (X', Y)' and the 
marginal density of X. 



Recent literature on model-based c lustering and classification through the 
CWM can be summarized as follows. Ilngrassia. Minotti. and Vittadinil (120121 ) 
study the relationships between the linear Gaussian CWM and some well- 
known mixture-based approaches, moreover they consider the t-distribution 
as a robust al ternative to Gaussian assumpt i ons. By using this model as a 
building block. Ilngrassia. Minotti. and Punzol ( 120121 ) introduce a family of par- 
simonious linear t-CWMs for model-based clustering . Finally, und er Gaussian 
assumptions for both mixture-component densities, iPunzol (120121 ) introduces 
the polynomial CWM as a flexible tool for clustering and classification pur- 
poses 



However, the applicability of linear Gaussian CWMs in high dimensional X- 
spaces still remains a challenge. The number of parameters for this model is 
(G - 1) + G (p + 2) + G [p + p (p + 1) /2], of which Gp (p + 1) /2 are used for 



the group covariance matrices S s of X alone, g 



G, and this increases 



quadratically with p. To overcome this issue, we assume a latent Gaussian 
factor structure for X, in each mixture-component, which leads to the Fac- 
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tor Regression Mod e l (FR M) of Y on x (see IWestl . 12003 . IWang et all . 12007 



and ICarvalho et all . l2008h . The FRM assumes £ 5 = A 5 A' + * g , where the 



loading matrix is a p x q matrix of parameters typically with q <C p and the 
noise matrix ty g is a diagonal matrix. The adoption of this group covariance 
structure in the linear Gaussian CWM framework leads to the linear Gaus- 
sian cluster- weighted factor analyzers model (CWFA), which is characterized 
by G [pq — q (q — 1) /2] + Gp parameters for the group covariance matrices. 
The CWFA model follows the principle of the general form of mixtures of 
factor analyzers rega r ding X; mixtures of factor analyzers were de veloped by 
McLachlan and Peel! (|20fl0j, Chapter 8) and[McLachlan et ap (l200a) on the ba- 
sis of the original work of Ghahramani and Hinton ( 19871) . Furthermor e , start - 
ing from the works of iMcNicholas and Murphvl (l2008h. iMcNicholasI fcoioh. 



Ingrassia. Minotti. and Vittadinil ( 120121 ) and llngrassia. Minotti. and Punzol (120121 ). 
a novel family of sixteen mixture models — obtained as special cases of the 
linear Gaussian CWFA by conveniently constraining the component variances 
of Y and X — is introduced to facilitate parsimonious model-based clustering 
and classification in the defined paradigm. 



The paper is organized as follows. Section [5] recalls the linear Gaussian CWM 
and the FRM (with details given in Appendix |A]); they are the basic models 
to define the linear Gaussian CWFA models introduced in Section [3j Model 
fitting with the alternating expectation-conditional maximization (AECM) al- 
gorithm is presented in Section HJ with details given in Appendix IB. 2 1 Section [5] 
addresses computational details on some aspects of the AECM algorithm and 
discusses model selection and evaluation. Artificial and real data are consid- 
ered in Section [61 and the paper concludes with discussion and suggestions for 
further work in Section [7J 



2 Model 



This section provides a step-by-step introduction to the model we introduce 
in the next section. 



2.1 The linear Gaussian cluster- weighted model 



Let p (x, y) be the joint density of (X', Y) . Suppose that Q can be partitioned 
into G groups, say Q±, . . . , Qg- The CWM defines the joint density as 

G 

P {x, y;0) = Y^ K g P (y\x, Q g ) p (x\Sl g ) , (1) 

9=1 
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where p(y\x, Q g ) is the conditional density of the response variable Y given x 
and Q g , p(x\Q g ) is the marginal density of x given Q g , ir g = p(fl g ) is the weight 



of Q g in the mixture (defined so that ix g > and J2 g 7T g 
and 6 contains all of the parameters in the mixture. 



1), 9 



,G, 



The component densities p(x\fl„ ) and p (y\x, fl 9 ) are usually assumed to be 
(multivariate ) Gaussian (see, e.g.. Ilngrassia. Minotti. and Vittadinil . 12012 and 
Punzol . 120121 ). the former with mean vector fi g and covariance matrix S 9 and 

the latter with linear conditional mean // (x, (3^ = [3o g + f3' lg x and conditional 
variance a 2 , where (3 = (/3o g ,{3' la ) , (3o g G R, and (3 lQ G R p . In other words, 



conditional on x and fi 9 , the linear model Y\x 



fi [x, f3 g J + e g holds. Thus, 
the general CWM in ([!]) becomes the linear Gaussian CWM 



G 



3=1 



P (oj, y;9) = J2 ^ (y\ x > ^ ( x > Pg) > °g) <f> ( x ' V g , s 9 ) , ( 2 ) 

where 

(f) (y\x]fi {x;P g ) ,°\ 
v 35 ) 53 9 



exp 



y-/i(x; (3 g ) 



£.,2 T/ie factor regression model 



The factor analysis model (jSpearmanl . Il904j and iBartlettl . Il953l ). for the p- 
dimensional variable X, postulates that 



X = /x + At/ + e, 



(3) 



where U ~ iV 9 (0, J g ) is a g-dimensional (g <C p) vector of latent factors, A is a 
pxq matrix of factor loadings, and e ~ N p (0, *Sf), with = diag • • • , V'p) > 
independent of Z7. Then X ~ N p (fJt, AA' + V&) and, conditional on results 
in X|tt ~ AT p (// + Aw, 

Model ([3]) can be considered similarly to the st andard (linear) regression model 
Y = n + 0',X + e leadin g to the FRM (see IWestl . 120031 . IWang et all . 120071 . 
and ICarvalho et al.l . 120081 ) 

Y = (3 + (3[ (fi + AU + e) + e = % + /3' lt i) + (3[AU + ((3[e + e) , 

where e is assumed to be independent of U and e. The mean and variance of 
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Y are given by 



E(F)=/3 + ft^ 
Var (Y) = Var ((3[AU) + Var (f3[e) + Var (e) 

= ^AA^j + ft*ft + a 2 = ft (AA' + tt) ft + a 2 , 

respectively, and so F ~ iV (/3 + /3[fi, ft (AA' + ft + <x 2 ). 
Consider the triplet (Y, X', [/') . Its mean is given by 





Y 




0o + Pit* 


E 


X 








u 








and because Cov(X, Y) = (AA' + *)ft and Cov(C7, F) = A'ft, it results 





Y 




ftEft + 


(T 2 ftE ft A 


Cov 


X 




Eft 


£ A 




u 




A'ft 


A' J g 



where E = AA' + ^ . Now, we can write the joint density of (F, X' , U') as 

p(y,x,u) = (j){y\x,u)(j){x\u)(j){u) . (4) 

Here, the distribution and related parameters for both X\u and U are known. 
Thus, we need only to analyze the distribution of Y\x, u. Importantly, E (Y\x, u) = 
E (Y\x) and Var(F|a;,u) = Vai(Y\x), and so Y\x,u ~ N (/3 + /3[x, a 2 ); 
mathematical details are given in Appendix |A] This implies that (j) (y\x, u) = 
4>(y\x) and, therefore, Y is conditionally independent of U given X = x, so 
that dlj becomes 

p(y,x,u) = ^(y|aj)0(aj|u)0(«). (5) 

Similarly, U\y, x ~ iV (7 (x — fjt) , I q — 7A), where 7 = A' (AA' + and 
thus U is conditionally independent on Y given X — x. Therefore, 

E [U\x; fj,, A, *] = 7 (x — fi) , and 

E [[/[/>; /x, A, *] = J 9 - 7A + 7 (x - /x) (x - »)' 7'. 
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3 The modelling framework 



3. 1 The general model 



Assume that for each £l g , g = 1, . . . , G, the pair (X', Y) satisfies a FRM, that 
is 

Y = /3 0g + P' lg X + e g with X = n g + A g U g + e g , (6) 
where A g is a p x g matrix of factor loadings, £/ 9 ~ N q (0, J 9 ) is the vector 
of factors, e g ~ N p (0, Sf^) are the errors, ty g = diag (^lp, ■ ■ ■ , Vw)> anc ^ £ 9 ~ 
A(0,c^). Then the linear Gaussian CWM in ([2]) can be extended in order to 
include the underlying factor structure ([6]) for the X variable. In particular, 
by recalling that Y is conditionally independent of U given X = x in the 
generic Q g , we get 

G 

V {x, y;0) = Yl n 90 {v\ x > V ( x ; Pg) ' °fj ( X i Vg, A 9 Ag + , (7) 

9=1 

where = yK g , (3 g , o~ g , fi g , A g , *& g ; g — 1, . . . , G|. Model (j7|) is the linear Gaus- 
sian CWFA, which we shall refer to as the CWFA model herein. 



3.2 Parsimonious versions of the model 



In this section, we extend the linear Gaussian CWFA by allowing constraints 
across groups on a g , A g , and *& g , and on whether or not ^ g = ip g I p (isotropic 
assumption). The full range of possible constraints provides a family of sixteen 
different parsimonious CWFAs, which are given in Table [TJ 



Here, models are identified by a sequence of four letters. The letters refer to 
whether or not the constraints a g = a 2 , A g = A, ^ g = and ^ g = ip g I p , re- 
spectively, are imposed. The constraints on the group covariances of X are in 
the spirit of lMcNicholas and Murph y (200 8]), while that on th e group variances 
of Y are borrowed from llngrassia. Minotti. and Punzol (120121 ). Each letter can 
be either C, if the corresponding constraint is applied, or U if the particularcon- 
straint is not applied. For example, model CUUC assumes equal Y variances 
between groups, unequal loading matrices, and unequal, but isotropic, noise. 



3.3 Model-based classification 



Suppose that m of the n observations in S are labeled. Within the model- 
based classification framework, we use all of the n observations to estimate 
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Table 1 

Parsimonious covariance structures derived from the CWFA model. 



A f „ J „1 TFi 

Model 1JJ 


Y Variance 


Loading Matrix 


Error Variance 


Isotropic 


Covariance parameters 
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unconstrained 


unconstrained 


unconstrained 


G + G [pq - 


-<?(<?- 


- 1) /2] + Gp 


uuuc 
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G + G[pq 


-<?(<? 
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q (q - 


1) /2] + G 
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G + [pq - 


- q (g - 


-l)/2]+p 
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G+[pq- 


- g {g - 


"l)/2] + l 


cuuu 
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1 + G[pq - 


-g{g- 


-l)/2] + Gp 


cuuc 
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unconstrained 


unconstrained 


constrained 


1+G[pq 


-g{g 


-l)/2]+G 


cucu 
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1 + G [pq 


-g{g 


-l)/2]+P 


cucc 
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1 + G [pq 


-q(q 


-l)/2] + l 
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1 + [pq - 


g (g - 


1) /2] + Gp 
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l + [pq- 
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cccu 


constrained 


constrained 


constrained 


unconstrained 


i + [pg - 


g(g- 


l)/2]+P 
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1 + [pq - 


g (q - 


l)/2] + l 



the parameters in (j7|); the fitted model classifies each of the n — m unla- 
beled observations through the corresponding maximum a posteriori proba- 
bility (MA P). As a special ca s e, if m = 0, we obt a in the clustering scenario. 
Drawing on lHosmer Jr.l (119731 ). iTitterington et al.l (119851 Section 4.3.3) point 
out that knowing the label of just a small proportion of observations a priori 
can lead to improved clustering performance. 



Notationally, if the zth observation is labeled, denote with Zi = (za, . . . , Zic) 
its component membership indicator. Then, arranging the data so that the 
first m observations are labeled, the complete-data likelihood becomes 



m G 

L c (0) = II II (Vi\ x i'i P \ X 'i Pg) ' a g) (»<!«»; M 9 , A 5 , * S J (f) (Uig) 
i=lg=l 
n G 

x n n (j/^'i (^i ^) - ff 9 2 ) ^ ^i^i *s) ^ 

i=m+l g=l 



For notational convenience, in this paper we prefer to present the AECM 
algorithm in the model-based clustering paradigm (cf. Section H]). However, 
the extension to the model-based classification context is simply obtained by 
substituting the 'dynamic' (with respect to the iterations of the algorithm) 
Zi, . . . , z m with the "static" Zi, . . . , z m . 
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4 Parameter Estimation 



4.1 The AECM algorithm 



The AECM algorithm ( jMeng and van Dykl . 119971 ) is used for fitting all the 
models within the family defined in Section [TJ This algorithm is an exten- 
sion of the expectation-maximization (EM) algorithm ([Dempster et all Il977l ) 
that uses different specifications of missing data at each stage. Let S = 
{(a?-, x/i)' ; i = 1, . . . , n j be a sample of size n from (17j) . In the EM framework, 

the generic observation (x^jji) 1 is viewed as being incomplete; its complete 
counterpart is given by (x' i: y^ u' ig , z'A , where Zi is the component-label vec- 
tor in which zi g = 1 if (acj, yi)' comes from Vt g and zi g = otherwise. Then the 
complete-data likelihood, by considering the result in fl5]), can be written as 



LAO) 



n G 

II II (Vi\ x i\ V Pg) > °f) <t> (»<|«<; fig, A 9 A^ 



=13=1 



[U 



1 1) > 



The idea of the AECM algorithm is to partition 6, say 0\ = {0'i,0'^) , in 
such a way that the likelihood is easy to maximize for 6\ given 62 and vice 
versa. The AECM algorithm consists of two cycles, each containing an E-step 
and a CM-step. The two CM-steps correspond to the partition of 6 into X 
and 02- Then, we can iterate between these two conditional maximizations 
until convergence. In the next two sections, we illustrate the two cycles for 
the UUUU model only. Details on the other models of the family are given in 
Appendix [Bl 



4-2 First cycle 



Here, 6\ = ^K g , (3 g , fi g , a g ; g — 1, . . . , Gj, where the missing data are the un- 
observed group labels Zi, i — 1, . . . , n. The complete-data likelihood is 



n G 



L l = II II (Vi\ X il A 4 ( X h Pg) , of) ^ ( X *l Vg, 



i=lg=i 
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Consider the complete-data log-likelihood 



G 



hi (Oi) = E E In [ir g (j) (vilxf, /i (xi\ f3 g ) , afj <p (xf, n g , A g , * g ) 

i=l 9=1 



n (p + 1) 1 " ° 2 . x . x . 



2 

n G 



i=l S =l 
n G 



i=l g=l 



i=l g=l 



i=lg=l 



9 E E *9 111 l S 9l - o E E *9 - A* fl ) S ^ (*i - M 9 ) + E % 111 Tfl. 

Z —1 — 1 Z — 1 g=l 



where % = E ^s- Because S ff = AA ff + \I> 9 , we get 



i=i 



1 



n(p+ 1) 



n G 



lll27r - oEE^S lllCT ff 



2 2 
» - (//,-• >u ;l inG 



i=l g=l 
, 2 



9 E E 

Z i=l <?=! 



-9EE^ ln aa; + * 



i=i 9 =i 



n G , / G 

9 E E ^9 tr - Mff) - Ms) ( A 9 A 'g + + E n 9 ln ^ff" 

Z i=l o=l <• J .9=1 



The E-step on the first cycle of the (A; + l)st iteration requires the calculation 
of Qi (O^d^^j = E^(fc) [Z c (0i) |«S], which is the expected complete-data log- 
likelihood given the observed data and using the current estimate 6^ for 0. In 
practice, it requires calculating E^o) [Zj 9 |«S]; this step is achieved by replacing 

each Zig by z^ +l \ where 



vrf (yjg; /i gf) , jW) fcjgg), A<*>, *f ) 

E V fc; $ fc) ) > 4 k) ) * A f > *f ) 
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For the M-step, the maximization of this complete-data log-likelihood yields 



(fc+i) 



7T 



„(*+!) 



/a(fc+l) 
Pig 



o(k+l) 
POg 

2(fe+l) 



(T 



1 n 



n 



E4 



(*+i) 



in 



X j 



9 i=l 



n, 



E 



(fe+i) , 



a i=i 

1 n 



n 



E 

9 i=l 



- ,/(*+!) If (fe+1) 



n 



9 i=l 



^ti i+i, {!/.-« +i, +/3'r , -.)} : 



9 i=l 



where n g k+1 ^ = ^2 z ig Following the notation in lMcLachlan and Peel! ( 2000 ). 



i=i 



we set e^ k+l ^ = {of +l \e { 2 k) ). 



4-3 Second cycle 

Here, 6 2 = g — 1, . . . , G} = {A 9 , *& g ; g = 1, . . . , G}, where the missing 
data are the unobserved group labels Zj and the latent factors u ig , i — 1, . . . , n 
and g — 1, . . . , G. Therefore, the complete-data likelihood is 

n G 

l c2 {o 2 ) = n II [<i> (vi\ x u a* P { 9 k+1) ) . °T +1) ) <t> s 9) </>K>? +1) * <9 

i=lg=l 
n G 

i=l g=l 

because Y is conditionally independent of U given X = x and 

(a^K; fi g k+1 \ * 9 ) = j^-p exp {-- (a* - - A^)' 1 (a* - n g k+1) ~ A ff w iff )} 

^ = (2^W ^ l"^"*} • 
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Hence, the complete-data log-likelihood is 



I* W = + In (2vr) - \ ± £ z i9 In af 

Z Z i=l 9 =1 



2 

n G 



i=l 9 =1 

where we set 



Z i=l 9=1 ZIJ 9 9=1 Z i=l 9=1 

n G 

2 



1 n G , -J 

2 £ E ^ tr { 0* - /4 fc+1) - A ^ 9 ) («* - /4 fc+1) - A ^ s ) 



Cf(*+1) _ 1 V 7 (k+1) (t. - ,i( k+1 A /V - l#( fc+1 >Y 
J 9 - ( fc+ i) Z iS P ^9 V 1 ^9 ) ■ 

Tig i=l 



The E-step on the second cycle of the (k + l)st iteration requires the calcula- 
tion of Q 2 {0 2 ; 9 {k+1/2) ) = E (fc+1/2) [l C 2 (62) \S]. Therefore, we must calculate 
the following conditional expectations: E«(*+i/2) (Z ig \S), E«(*+i/2) (Zj fl Ui g |<S), 

and Efl(*+i/2) {z ig U ig U' ig \S^. Based on (12.21) . these are given by 



(k+1) 

9 



E 0(fc+1/2 ) = i fc+1) {l 9 - TfAf + T^T™} = i fc+1) ©9 fc) ^ 

where 



= A /(*) f A W A 'M 4. *>(*0 N ~ ' 
'0 V 



ef =i g - 7 w A w +7 w 5 (^) 7 f). ( 9 ) 

Thus, the gth term of the expected complete-data log-likelihood Q2 (0 2 ; #( fe+1 / 2 ) 
becomes 



Q 2 (A9, ^ (fc+V2) ) = C(0[ k+1) ) + In |*- 1| - I^tr {S^*; 1 } 

(10) 



where C denotes the terms in (14.31) that do not depend on 6 2 . Then 
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( TTUj) is maximized for {A, satisfying 



<9Q 



2 



<9A 9 

^ 2 _ i„(fe+l)vl> _ !„(*+!) «?(*+!) i „(Hl)o'(Hl) _ !„(*+!) A fiWA' _ n 

~~ 2 9 3 2 9 9 9 9 9 9 2 9 9 9 

Therefore, 



W-A 9 0f = O (11) 

* 9 - + 2s; (fe+i) 7 ; (fe) a; - A g ©« a; = o. (12) 

From (TTTT) . we get 

a, = s( fe+1 ) 7 f >e,-\ (is) 

and substituting in ffT2l we get 

- + 2 s<* 1 > 7 f) (sf-^f )e; 1 )' - (s^e; 1 ) e 9 (s^e; 1 )' = o 

which yields 

* g = di ag {S g k ^-A g %S g k ^}. (14) 

Hence, the maximum likelihood estimates for A and are obtained by itera- 
tively computing 

A+ = 5(fe+D 7 ^0 g 1 

4 = dilg{5H ) -A 3 + 7^}, 

where the superscript + denotes the update estimate. Using and we 
get 



0+ 



:A' + 

9 



A + A' + 

9 a 



9 



-1 



7 + A + + 7 + 5 (fc+1) y+. 

1 q q 1 I q q 1 q 



(15) 



^.^ Outline of the algorithm 



In summary, the procedure can be described as follows. For a given initial 
guess 0^°\ on the (k + l)st iteration, the algorithm carries out the following 
steps for g — 1, . . . , G: 

(1) Compute n g k+1 \fi g k+1 \f3 g k+1 \af k +^; 

(2) Set A g <- A g h) and * <- *b g k \ and compute -y g and 9 ; 
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(3) Repeat the following steps until convergence on A g and Sf^: 

(a) Set A+ «- flf^ej 1 and *J «- diag{s( fc+1 > - A+t,^ 1 )}; 

(b) Set 7 + <- A' g + (A+A; + + *+)~ 1 and 0+ <- I-^+A^+S^ 1 ^ 

(c) Set A g <- A+ <- *+ 7 9 <- 7+ and @ g <- 0+ 



^.5 AECM initialization: a 5-step procedure 



The choice of starting values is a well known and important issue with respect 
to EM-based algorithms. The standard approach consists of selecting a value 
for 6^°K An alternative method, more natur al in the authors' opinion, c onsists 
of choosing a value for z\ , i — 1, . . . , n (see iMcLachlan and Peeilioool . p. 54). 



Within this approach, and due to the hierarchical structure of the CWFA 
family of parsimonious models, we propose a 5-step hierarchical initialization 
procedure. 

,(o) 



For a fixed number of groups G , let z\ , i = 1, ...,n, be the initial clas- 

4 0) G {0,1} and E,4 0) 



sification for the AECM algorithm, so that z$ G {0, 1} and J2 q z ig — 1- 



The set {^^ ;i = 1, . . . ,n| can be obtained either through some clustering 
procedure (here we consider the /c-means method) or by random initializa- 
tion, for example by sampling from a multinomial distribution with prob- 
abilities (1/G, . . . ,1/G). Then, at the first step of the procedure, the most 
constrained CCCC model is estimated from these starting values. At the 
second step, the resulting (AECM-estimated) z ig are taken as the starting 
group membership labels to initialize the AECM-algorithm of the four mod- 
els {UCCC, CUCC, CCUC, CCCU} obtained by relaxing one of the four con- 
straints. At the third step, the AECM-algorithm for each of the six models 
{CCUU, CUCU, UCCU, CUUC, UCUC, UUCC} with two constraints is ini- 
tialized using the Zi g from the previous step and the model with the highest 
likelihood. For example, to initialize CCUU we use the Zi g from the model 
having the highest likelihood between CCCU and CCUC. In this fashion, the 
initialization procedure continues according to the scheme displayed in Fig. [U 
until the least constrained model UUUU is estimated at the fifth step. 



For al l of the models in the CWFA family, in analogy with lMcNicholas and Murphy 



(120081 ). the initial values for the elements of A g and *& g are generated from the 
eigen-decomposition of S g as follows. The S g are computed based on the values 
of z\g . The eigen-decomposition of each S g is ob tained using the H ouseholder 



reduction and the QL method (details given by lPress et all Il992l ). Then the 
initial values of the elements of A g are set as Ay = ^JdjPij, where dj is the 
jth largest eigenvalue of S g and pij is the ith element of the eigenvector cor- 
responding to the jth largest eigenvalue of S g , where i G {1,2, ... ,d} and 
j € {1,2,..., q}. The ty g are then initialized as g = diag (^S g — A g A' g 
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cccc 




cccu ccuc cucc uccc 




ccuu cucu uccu cuuc ucuc uucc 




cuuu ucuu uucu uuuc 




uuuu 



Fig. 1. Relationships among the models in the 5-step hierarchical initialization 
procedure. Arrows are oriented from the model used to initialize to the model to be 
estimated. 

4-6 Convergence criterion 



The Aitken acceleration procedure (lAitkenl . Il926l ) is used to estimate the 
asymptotic maximum of the log-likelihood at each iteration of the AECM 
algorithm. Based on this estimate, a decision is made about whether the algo- 
rithm has reached convergence, i.e., whether the log-likelihood is sufficiently 
close to its estimated asymptotic value. The Aitken acceleration at iteration 
k is given by 

/(fc)_/(fc-i)' 

where l^ k+1 \ l^ k \ and Z( fc_1 ) are the (observed-data) log-likelihood values from 
iterations k + 1, k, and k — 1, respectively. Then, the asymptotic estimate of 
the log-likelihood at iteration k + 1 is 



j(k+l) _ ^k)\ 



(IBohning et al.l . I1994J ) . In the analyses in Se c tion El we stop our algo rithms 



when 
that we use e 



l {k) < e (IBohning et all Il994! ; McNicholas et all BOlOl ). Note 



0.05 for the analyses herein. 
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5 Model selection and performance assessment 



5.1 Model selection 



The CWFA model, in addition to 6, is also characterized by the number of 
latent factors q and by the number of mixture components g. So far, these 
quantities have been treated as a priori fixed. Nevertheless, the estimation of 
these is required, for practical purposes, when choosing a relevant model. 



For model-based clustering and classification, several mo del select i on cr iteria 
are used, such as the Bayesian informa tion criterion (BIC:ISchwara . ll978l ). the 



integrated completed likelih ood (1CL; iBiernacki et al.l . 120001 ) , and the Akaike 



information criterion (AIC; Sakamoto et al. , 19831 ). Among these, the BIC is 
the most predominant in the literature and is given by 



BIC = 21 (6) -rjki (n) , 

where I (d^j is the (maximized) observed-data log-likelihood and rj is the num- 
ber of free parameters. This is the model selection criterion used in the analyses 
of Section El 



5.2 Adjusted Rand index 



Although the data analyses of Section [6] are mainly conducted as clustering 
examples, the true classifications are actu ally known for these data. In these 
examples, the adjusted Rand index (ARI; Hubert and ArabieJ . 1985) is u sed 
to measure class agreement. The original Rand Index (RI; iRandl . 119711 ) is 
based on pairwise comparisons and is obtained by dividing the number of 
pair agreements (observations that should be in the same group and are, plus 
those that should not be in the same group and are not) by the total number 
of pairs. The ARI corrects the RI to account for agreement by chance: a 
value of '1' indicates perfect agreement, '0' indicates random classification, and 
negative values indicate a classification that is worse than would be expected 
by guessing. 



6 Data analyses 



This section presents the application of the family of parsimonious linear Gaus- 
sian models to both artificial and real data sets. Code for the AECM algo- 
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rithm, described in this paper, wa s written in the R computing environment 
(|R Development Core Team! 120121 ). 



6.1 Simulated data 
6.1.1 Example 1 

The first data set consists of a sample of size n = 175 drawn from model UUCU 
with G — 2, nx — 75, n 2 = 100, d = 5, and q = 2 (see Fig. |2] for details). 
The parameters used for the simulation of the data are given in Table |2] (see 
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Fig. 2. Example 16. 1.1 1 scatterplot matrix of the simulated data. 
Appendix IC. II for details on the covariance matrices S ff , g — 1, . . . , G). 

All of the sixteen CWFA models were fitted to the data for G G {2, 3} and 
q G {1,2}, resulting in a total of 64 models. As noted above (Section 14.51) . 
initialization of the Zi, i = 1, . . . , n, for the most constrained model (CCCC), 
and for each combination (G,q), was done using the k- means algorithm ac- 
cording to the kmeans function of the R package stats. The remaining 15 
models, for each combination (G,q), were initialized using the 5-step hierar- 
chical initialization procedure described in Section 14.51 The BIC values for 
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Table 2 

True and estimated parameters for the simulated data of Example 1. 

(a) Means of X 









»9 
















Group 


Xi 


x 2 


x 3 


x A 


x 5 


Xt 


x 2 


x 3 


X A 


x 5 


1 


14.00 


18.00 


25.00 


14.00 


22.00 


15.88 


19.94 


27.48 


15.81 


23.93 


2 


-12.00 


-10.00 


-22.00 


-20.00 


-22.00 


-11.95 


-10.36 


-22.00 


-19.67 


-22.03 



(b) Slopes 



Group 


Xi 


x 2 x% 


X A 


x 5 


X\ X 2 


K 

x 3 


X A 


x 5 


1 


0.47 


0.02 0.42 0.03 


0.87 


0.50 0.03 


0.46 


0.02 


0.81 


2 


-0.02 


-0.63 -0.05 -0.85 


-0.03 


-0.04 -0.57 


-0.01 


-0.85 


-0.18 




(c) Conditional std. 
tions 


devia- 




(d) Intercepts 










Group 






Group 


A)g 










1 


2.00 


1.24 


1 


4.50 


4.34 








2 


4.00 


3.79 


2 


-4.20 


-6.35 







all 64 models were computed and the model with the largest BIC value was 
selected as the best model. In this example, the model corresponding to the 
largest BIC value (-5845.997) was a two component (G = 2) UUCU model 
with two latent factors (q = 2), the same as the model used to generate 
the data. The selected model gave a perfect classification and the estimated 
parameters were very close to the parameters used for data simulation (see 
Table [2] and Appendix IC.lj) . 

Fig. [3] shows the BIC values of all 64 models sorted in an increasing order, 
where numbers denote the selected number G of groups and colours denote 
the number q of latent factors. The horizontal line separates the models with 
a BIC value within 1% of the maximum (over the 64 models) BIC value 
(hereafter simply referred to as the '1% line'). This graphical representation 
will be referred to as the 'group-factor plot' of BIC values. Here, as mentioned 
earlier, the model with the largest BIC was UUCU (with G = 2 and q = 2). 
The subsequent two models, those above the 1% line, were UUUU with G = 2 
and q = 2 (BIC equal to -5867.006) and CUCU with G = 2 and q = 2 (BIC 
equal to —5869.839). These two models are structurally very close to the true 
UUCU model and also yielded perfect classification. It should also be noted 
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Fig. 3. Group- factor plot of BIC values sorted in an increasing order for Example 1. 
Numbers denote the selected number G of groups and colours denote the number 
q of latent factors (red: q = 1, blue: q = 2). The green number indicates the true 
model. 

that most of the models with high BIC values have G = 2 and q — 2. 



6.1.2 Example 2 

For the second data set, a sample of size n = 235 was drawn from the CUUC 
model with G = 3 groups (of size n\ = 75, n 2 = 100, and = 60) and q = 2 
latent factors (see Fig. HJ) . 

All 16 CWFA models were fitted to the data for G G {2, 3, 4} and q 6 {1, 2}, 
resulting in 96 different models. The algorithm was initialized in the same way 
as for Example 2. The model with the highest BIC (-6579.116) was CUUC 
with G = 3 and q = 2, resulting in a perfect classification. The estimated 
parameters of this model were very close to the true ones (Table [3] and Ap- 
pendix [C72]). 

Fig. [5] shows the group-factor plot of BIC values for all 96 models. The other 
three models above the 1% line are UUUC (BIC= -6583.692), CUUU (BIC= 
-6637.222), and UUUU (BIC= -6641.798), all with G = 3 and q = 2. Thus, 
these models are congruent, with respect to the true one, in terms of G and 
q. Moreover, they had a covariance structure more similar to the true one 
(CUUC) and also yielded perfect classification. 
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Fig. 4. Scatterplot matrix of the simulated data for Example 2. 
6.2 The f. voles data set 



In addition to the simulated data analyses of Section I6.ll the family of CW- 
FAs was also applied to a real data set for both clustering and classification 
purposes. 



The f. voles data set, detailed in iFluryl (119971 Table 5.3.7) and available 
in the Flury package for R, consists of measurements of female voles from 
two species, M. californicus and M. ochrogaster. The data consist of 86 ob- 
servations for which we have a binary variable Species denoting the species 
(ni = 45 Microtus ochrogaster and n\ = 41 M. californicus), a variable Age 
measured in days, and six remaining variables related to skull measurements. 
The names of the variables are the sam e as in the original analysis of this 
data set by lAiroldi and Hoffmannl (119841 ): l_2 = condylo-incisive length, Lg = 



length of incisive foramen, L7 = alveolar length of upper molar tooth row, B3 - 
zygomatic width, B4 = interorbital width, and Hi = skull height. All of the 
variables related to the skull are measured in units of 0.1 mm. 



The purpose of lAiroldi and Hoffmannl (1 19841 ) was to study age variation in 
M. californicus and M. ochrogaster and to predict age on the basis of the 



19 



Table 3 

True and estimated parameters for the simulated data of Example 2. 

(a) Means of X 



Group 


Xi 


x 2 


»g 

Xz X^ X5 


Xt 


x 2 


»9 

x 3 


X A 


x 5 


1 


0.00 


0.00 


-5.00 0.00 -4.00 


0.82 


0.48 


-5.09 


-0.21 


-3.75 


2 


14.00 


18.00 


25.00 14.00 22.00 


13.64 


17.44 


25.44 


14.25 


21.44 


3 


-12.00 


-10.00 


-22.00 -20.00 -22.00 


-12.33 


-10.22 


-22.25 


-20.24 


-22.21 








(b) Slopes 












Group 


x 1 


x 2 


Pig 

X% X4 X5 X\ 


x 2 


K 

x 3 


x 4 


x 5 




1 


-0.41 


-0.87 


-0.22 -0.62 -0.06 -0.34 


-0.82 


-0.32 


-0.66 


-0.09 




2 


0.47 


0.02 


0.42 0.03 0.87 0.51 


0.00 


0.38 


0.05 


0.84 




3 


-0.02 


-0.63 


-0.05 -0.85 -0.03 -0.04 


-0.68 


-0.36 


-0.44 


-0.18 





(c) Conditional std. devia- 
tions ( d ) Intercepts 



Group 


a 9 


°9 


Group 


fog 




1 


2.00 


2.30 


1 


30.00 


29.39 


2 


2.00 


2.30 


2 


4.50 


5.31 


3 


2.00 


2.30 


3 


-4.20 


-6.69 



skull measurements. For our purpose, we assume the data are unlabelled with 
respect to Species and that our are interest is in evaluating clustering and 
classification using the family of CWFA models as well as comparing the algo- 
rithm with some well-established mixture model-based techniques. Therefore, 
Age can be considered the natural Y variable and the d = 6 skull measure- 
ments can be considered as the X variable for the CWFA framework. 



6.2.1 Clustering 

All sixteen linear Gaussian CWFA models were fitted — assuming no known 
group membership — for G G {2, . . . , 5} components and q G {1, 2, 3} latent 
factors, resulting in total of 192 different models. The model with the largest 
BIC value was CCCU with G = 3 and q = 1, with a BIC of -3837.698 and 
an ARI of 0.72. Table 4(a) displays the clustering results from this model. 
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Fig. 5. Group-factor plot of BIC values sorted in increasing order for Example 2. 
Numbers denote the selected number G of groups and colours denote the number 
q of latent factors (red: q = 1, blue: q = 2). The green number indicates the true 
model. 



Furthermore, Table 4(b) and Table 4(c) show, respectively, the clustering re- 
Table 4 

Clustering of f .voles data using three different clustering approaches. 



(a) CWFA 



(b) PGMM 



Est. 

TRUE\-^ 


1 


2 


3 


Est. 

TRUE^^~~~~-~^ 


1 


2 


3 


ochrogaster 


24 


21 




ochrogaster 


34 


9 


2 


californicus 






41 


californicus 






41 



(c) MCLUST 



Est. 


1 




2 


TRUE~~--^ 




ochrogaster 


43 


2 


californicus 




41 



suits of the following model-based clustering approaches applied to the vector 
(X',Y)': 
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P GMM : parsimonious late nt Ga ussian mixture models as described in lMcNicholas and Murphy 
(l2008h . lMcNicholasl fl2010h . and lMcNicholas et all fcoioh. and e stimated via 



the pgmmEM function of the R-package pgmm (IMcNicholas et al.l . 120111 ); and 



MCLUST: parsimonious mi x tures of Gaussian distribut ions as described in 



Banfield and Rafteryl (ll993f ). ICeleux and Govaertl ( 119951 ). and lFraley and Raftery 
(2002 ), and estima ted via the Mclust function of the R-package mclust (see 



Fralev et al.l . I2012L for details) 



As seen from Table HJ M. calif ornicus was classified correctly using all three 
approaches. Also, M. ochrogaster was classified into two sub-clusters using 
CWFA and PGMM while MCLUST classified it into one cluster. However, the 
CWFA approach had no misclassifications between the two species but both 
PGMM and MCLUST misclassified two M. ochrogaster as M. calif ornicus. 




Fig. 6. Group-factor plot of BIC values sorted in an increasing order for the f . voles 
data. Numbers denote the selected number G of groups and colors denote the num- 
ber q of latent factors (red: q = 1, blue: q = 2, green: q = 3). 

Now, we evaluate the group-factor plot of BIC values, for all 192 models, dis- 
played in Fig. [6j Ten models had a BIC above the 1% line; among them, 
six were characterized by G = 3 components and the remaining four by 
G = 2. However, from Fig. El the top four models all had three compo- 
nent s, which shows that a three component model was not randomly cho- 



sen. 



Airoldi and Hoffmann! ( 119841 ) mention that some unexplained geographic 
variation may exist among the voles. However, no covariate was available with 
such information. Hence, we opted for the scatter plot matrix to evaluate the 
presence of sub-clusters, see Fig. [7J Here, the scatter plot of the variables B 3 
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Fig. 7. Scatterplot matrix of f .voles data showing the classification observed from 
CWFA modelling using the clustering framework, where black and red indicates 
sub-clusters of M. ochrogaster species and green indicates M. californicus species. 

versus B 4 shows the presence of distinct sub-clusters for M. ochrogaster, which 
supports our results attained using CWFA modelling. 



6.2.2 Classification 



A subset of observations, consisting of 50% of the data, was randomly selected 
and these observations were assumed to have a known group membership. To 
allow for the unobserved sub-cluster noted in the clustering application of 
Section I6.2.1[ we ran the algorithm for G = 2, 3 and q — 1,2, 3. The best 
model (CCUU with G = 2 and q = 1) selected by the BIC (-3843.482) gave 



a perfect classification, as we can see from Table 5(a) 



Table 5 

Classification of f .voles data assuming that 50% of the observations have known 
group membership. 



(a) 2 known groups 



(b) 3 known groups 



Est. 

TRUE~~-\^ 


1 


2 


Est. 

TRlJE^^~~~-~^^ 


1 2 


3 


ochrogaster 


45 




ochrogaster 


28 17 




californicus 




41 


californicus 




41 



23 



We also ran the classification assuming that the data are actually comprised 
of three known groups. Therefore, using the classification observed by cluster- 
ing, we also ran the classification algorithm with 50% known (i.e., labelled) 
and 50% unknown (i.e., unlabelled). To further allow for the unobserved sub- 
cluster, we ran the algorithm for G G {3,4} and q G {1,2,3}. The model 
selected using the BIC was CCCU with G = 3 and q — 1, with a BIC value 
of —3837.383. Even though the BIC value observed using the classification 
approach (with three known groups membership) was very close to the BIC 
value using clustering, the sub-clusters do not have precisely the same clas- 
sification using the classification and clustering approaches. This could be a 
consequence of the classification of borderline observations among the sub- 
clusters using maximum a posteriori probability. However, the BIC value for 
the classification using three known groups was higher than the BIC value 
using two known groups, which again suggests the presence of sub-clusters. 



7 Conclusions, discussion, and future work 



In this paper, we introduced a novel family of 16 parsimonious mixture mod- 
els for model-based clustering and classification. They are linear Gaussian 
cluster-weighted models in which a latent factor structure is assumed for 
the explanatory random vector in each mixture component. The parsimo- 
ni ous versions are obtained by c ombining all of the constraints described 



in 



McNicholas and Murphvl (20081) w ith one of the constraints illustrated in 



Ingrassia. Minotti. and Punzol (120121 ) . Due to the introduction of a latent fac- 
tor structure, the parameters are linear in dimensionality as opposed to the 
traditional linear Gaussian CWM where the parameters grow quadratically; 
therefore, our approach is more suit able for modelling comple x high dimen- 



sional data. The AECM algorithm ( iMeng and van Dykl . 119971 ) was used for 



maximum likelihood estimation of the model parameters. Being based on the 
EM algorithm, it is very sensitive to the starting values due to presence of mul- 
tiple local maxima in high dimensional space. To overcome this problem, we 
proposed a 5-step hierarchical initialization procedure that utilizes the nested 
structures of the models within the family. Because these models have a hi- 
erarchical/nested structure, this initialization procedure guarantees a natural 
ranking on the likelihoods of the models in our family. In other words, the 
procedure restricts a model A, which is nested in a model B, from having a 
greater likelihood than model B. Using artificial and real data, we demon- 
strated that these models give very good clustering performance and that the 
AECM algorithms used were able to recover the parameters very well. 

With regard to the latent factor structure, for a latent dimension q > 1, the 
loading matrix A is unidentifiable because the model is still satisfied even when 
the latent factor Ui is replaced by H«j and A by AH', where H is any orthog- 
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onal matrix of order q (IMcLachlan and Peell . l2000l ). This results in an infinite 



number of possibilities for A. Even though this does not affect the clustering 
algorithm interpretation of the estimated A is not informative because AA' 
is unique. A future avenue of research is to explore further constraints on the 
factor loading matrix to ensure a uniquely defined factor loading matrix A. 

Also, while the BIC was able to identify the correct model, the choice of a 
convenient model selection criterion for these models is still an open question. 
Some future work will be devoted to the search for good model selection crite- 
ria for these models. Finally, here we assumed that the number of factors was 
the same across groups, which might be too restrictive. However, assuming 
otherwise also increases the number of models that need to be fitted, resulting 
in huge computational burden. Approaches such as variational Bayes approx- 
imations might be useful for significantly reducing the number of models that 
need to be fitted. 



A The conditional distribution of Y\x,u 



To compute the distribution of Y\x,u, we begin by recalling that if Z ~ 
N q (m, r) is a random vector with values in M. q and if Z is partitioned as 
Z = (Z' v Z' 2 )\ where Z x takes values in W 1 and Z 2 in W 2 = W~ qi , then we 
can write 



rn 



rri\ 



and T 



Tii r 12 
r 2 2 



Now, because Z has a multivariate normal distribution, Z\\Z 2 = z 2 and 
Z 2 are statistically independent with Zi\Z 2 = z 2 ~ N qi {rn\\ 2 -, and 
Z 2 ~ N q2 (m 2 , r 22 ), where 

m ll2 = m 1 + T 12 T 22 1 (z 2 -m 2 ) and T^ = Tn - r^r^^i. (A.l) 

Therefore, setting Z = (Z[, Z' 2 )' , where Z[ = Y and Z 2 = (X',U')', gives 
m i = Po + (3'ifJ. and m 2 = (/x', 0')', with the elements in T given by 



Tn =/3' 1 E/3 1 + a 2 



22 



S A 
A' /„ 



and r 



12 



It follows that Y \x, u is Gaussian with mean m y \ x u = K(Y\x,u) and variance 

9 -ir /-trl \. 1 .,1,1 

a, 



y\x,u 



Var (Y \x, u), in accordance with the formulae in f ]A.lj) . Because the 



inverse matrix of r 22 is required in (lA.ip , the following formula for the inverse 
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of a partitioned matrix is utilized: 

-i 



A B 
C D 



-1 



A — BD X C) —A l B [D — CA X B 

i /_ _ . 1 _\ -i 



-i" 



-D^C (A - BD- 1 C) (D-CA l B 

Again, writing S = A A' + we have 



1 22 



£ A 




A' I q 





* - 1 -S _1 Afl„- A'E^A 



',T, 1 



A'* 



I n - A'E _1 A 



Moreover, according to the Woodbury identity (jWoodburyl . Il950l ): 

S" 1 = (AA' + *) 1 = * 1 - * _1 A(J, + A / *~ 1 A) -1 A / * -1 . 

Now, 



f3[A 



* - 1 -S _1 A f J„ - A'S^A 



-A*" 1 (I, - A'E^A)- 1 
Finally, according to flA.lD . we have 



#0 



*nj,|a. )U = mi + r^r^ 1 



22 - rn 2 



a l\ x , u = r n - r i2r 2 2 1 r 2 i = (3['Sf3 1 + a 2 



f3[0 
A/3, 



a; — /x 
u - 

= a 2 . 



A) + to 



B Details on the AECM algorithm for the parsimonious models 



This appendix details the AECM algorithm for of all the models summarized 
in Table[TJ 



B.l Constraint on the Y variable 



In all of the models whose identifier starts with 'C, that is the models in 
which the error variance terms a 2 (of the response variable Y) are constrained 
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to be equal across groups, i.e., a 2 = a 2 for g = 1, . . . , G, the common variance 
a 2 at the (k + l)th iteration of the algorithm is computed as 

<^ i) =|££>r > {*-(flr > +rfr ) *')}'- 

n 1=1 0=1 



5.^ Constraints on the X variable 



With respect to the X variable, as explained in Section 13. 2[ we considered 
the following constraints on £ 9 = A g A' g + ^ g : i) equal loading matrices A g = 
A, ii) equal error variance *& g = and iii) isotropic assumption: *& g = 
ipglp. In such cases, the gth term of the expected complete-data log-likelihood 

Q 2 (0 2 ; 6 {k+1/2) ) , and then the estimates (EED and flU]) in Section EDS, are 
computed as follows. 



B.2.1 Isotropic assumption: ^ g = ijiglp 
In this case, Equation (fTOj) becomes 



Q 2 (A 9 , ^ 0^) = C (6[ k+1) ) + \nf^ In \^I P \ - \nf^ g hr {sf^} 

+n^ g hr{^S g k ^A g } - in(|H-iV;Hr{A a eWA 



2 

2 

yielding 



a^- 1 2 3 



W> 9 - tr {S<, fe+1) } + 2tr { 7 f +1 )A g } - tr {A fl 0f A fl }] 



Then the estimated ip g is attained for ^g, satisfying 
dQ 2 



=► p0,-tr {5( fc+1 )}+2tr{ 7 «5( fc+1 )A g }-tr {^Aj} = 0. 



Thus, according to (USD, for A g = A g = S^j'^Q- 1 we get tr {A g @ g k) A' g } 
tr{ 7 ( fc )5f +1) A 9 } and, finally, 

^ 9 = hr{S^-A g ^S^}. 

Thus, 
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^^{S^-A^Sf^} (B.l) 

with 0+ computed according to (fT5l) . 

B.2.2 Equal error variance: g = ^ 
In this case, from Equation ffTUj) . we have 

Q 2 (A g , 6^) = C(6[ k+1) ) - \nf^ In |*| - \nf +1 \r {s^*- 1 } 



yielding 



V2 V 3 ' ' Z. _ A T? (fe+i)i|/_i r? (fe+i)s( fe+1 )+r?( fc+1 W( fc+1 )V( fe )A'--r7( fc+1 )A flWV 

_ 2 3 2 9 9 9 5 5 9 2 s 9 V 

Then the estimated is obtained by satisfying 

g <9Q 2 (a 9 , *; e (fc+1/2) ) _ 

that is 

V n< fc+1 >S< fc+1 > + f^^f^'WA' -- f n f +1 'A,0«A' = 0, 

9 o 3 9 9 9 '9 9 9 9 9 9 g ' 

Z Z g=l 9=1 Z 9 =l 

which can be simplified as 

0, 



n 1 ° 



A 9=1 

G 

with ^n^ fc+1) = n. Again, according to (USD, for A 9 = A g = S g k+1) -ff J0- 1 
9=1 

we get A g S g k) A' g = A g j g k ^S g k+1) and, afterwards, 
G n n 



* = E ^ diag {S^ - A g7 W5( fc+1 )} = £ Trf +1 )diag {sf^ - k gl f S^} . 

(B.2) 



Thus, 
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* + = E < +1) diag {S( fc+1 > - Ai-YgSW} , (B.3) 



9=1 

) ,,. ■ A g U- (J 



7+ = A'„ ( A+ A'+ + * 



where 0+ is computed according to flloj) . 

B.2.3 Equal loading matrices: A g = A 

In this case, Equation ffTOj) can be written as 

Q 2 (A, 0^) = C(0[ k+1) ) + in^lnl^J 1 ! - ^ fc+1) tr {sf +1 ^} 
+ n g k+1 hr {A 7 f x } - ~n a fc+1) tr {A't^ABf} . 

yielding 

Then the estimated A is obtained by solving 

e 1 ' a : — 1 = e -r^; 1 [sps™ - a©« ] = o, ( B . 4) 

9=1 °^ 9=1 

with 7^ fc) = A' (fc) ^A (fc) A' (fc) + \I> 9 fc) ) \ In this case, the loading matrix cannot 
be solved directly and must be so lved in a row-by-row manner as suggested 



by iMcNicholas and Murphyl (120081 ). Therefore 



n. 



\ + = r* E-^eJ (B.5) 
\ 9 =l Wg{€) j 

7 + = A'(A + A' + + *+) _1 (B.6) 

®9 + ^t A+ + l + gSf +1) l7i (B- 7 ) 

where Xf is the ith row of the matrix A + , ipg{i) is the zth diagonal element of 

g _ x 

*i?g, and Ti represents the ith row of the matrix ^ n g k+1 ^ y^'g) S g k+1 \ 



9=1 



B.2-4 Further details 

A further schematization is here given without considering the constraint on 
the Y variable. Thus, with reference to the model identifier, we will only refer 
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to the last three letters. 



Models ended by UUU: no constraint is assumed. 



Models ended by UUC: ^f g = ip g I p , where the parameter ib g is updated 
according to (IB.ip . 

Models ended by UCU: ^ g = where the matrix \I/ is updated accord- 
ing to (lR3j) . 

Models ended by UCC: * 5 = ipl p . By combining flB~T|) and (lB~3j) we ob- 
tain 



i G ^(fc+1) i G 

(B.8) 

Thus, 



r= l -t< +1 ^{sr i} -KyA t+1> } 

P 9=1 

7 9 + =a; + (a 9 + a; + +^) _1 . 



with 0+ computed according to ( fTBT) . 

Models ended by CUU: A 9 = A, where the matrix A is updated accord- 
ing to (IB. 51) . In this case, SE^ is estimated directly from (|T2l) and thus 
= diag{5( fc+1) -2A + 7 9 5( fc+1) + A+0 9 A'+}, with 7+ and 0+ com- 
puted according to (1B.6j) and (IB. 7ft . respectively. 



Models ended by CUC: A g = A and ^ 9 = ^> ff J p . In this case, equa- 
tion (IB. 41) . for * g = ^ 9 J p , yields 

g sOo (a ib ■ e^ k+1 / 2 A g g 

e 1 — 1 = e ww-i "r'v^ef = o, 

9=1 aiV 9=1 9=1 



and afterwards 



/ G (fc+1) \ / G \ 
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with -yf = A' (fc) (A (fc) A' (fc) + ijjf) J p ) \ Moreover, from 



dQ 2 (A,^;0 



(fe+l/2) 



11 



dip; 1 



V 9 2 




(k+i) 



tr {S g k+ ^} - 2tr {S^+^f ] A'} + tr { AQf +1 ^A'} 



we get ^ 9 = (l/p)tr [S g k+1) - 2Ay'^S g + A0 ff A'}. Thus, 



A H 



G n {k+l) 



G n (k+l) 



-1 



1 t 



^ = ~tr {S^ - 2 A + ~f' g S g + A+0A'+} 



7 : = A' + (A + A' + + ^ + / p 



-i 



with 0+ computed according to flB.7p . 

Models ended by CCU: A g = A and * 9 = so that 7 « = A ,(/c) (A (fc) A (fc) + * 
Setting * 5 = * in ( 1B.4j) . we get 



r(*) 



g ag 2 (a, g( fc+1 / 2 ) 



G 



9=1 



<9A 



5- n (*+i)^-i £(*+i) 7 '(*)_ A e 



9=1 

* 1 

* 1 



G 



(k) 
9 

G 



y(fc) £ ^fc+Dfifffc+D _ A n (W) (*) 



9=1 

/(fe) s (k+i) _ Ae (fc) 



9=1 



where 



s (k+i) = y^fc+ij^fc+i) 

9=1 
G 

©(fc) = 7r (*+i)0(fe) _ / f; _ 7 (fc)A( fc ) + 7 ( fc ) s {k+1) -y' {k) . 

9=1 

Thus, 

A = 5 (fc+i)yW( e W)- 1 . (B9) 

Moreover, setting A g = A in (jR2]l . we get * = diag {5 (/c+1) - A 7 ( fc )£ (fc+1) }. 
Hence, 



A+ = 5 (fc+Dy -i 
* + = diag{5( fe+1 )-A+ 7 5( fc+1 )} 
7 + = A'+ (A+A'+ + *- 



(B.10) 
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with 0^ computed according to (1B.7j) . 

Models ended by CCC: A g = A and * 9 = i/>I p , so that 7W = A /(fc) (A (fc) A' (fe) + 
Here, the estimated loading matrix is again (1B.9|) . while the isotropic term 
obtained from flBj| for A g = A is $ = (l/p)tr {s {k+1) - Ay^ S {k+1) } , 

with 7 « = Af ] (A g k) A' g {k) + ^Ip)"*- Hence, 
^ + = -trls( fe+1 )-A + 7 S( fc+i n 

p L J 

7 + = A /+ (A + A'+ + ^ + J p ) _1 , 
with A + and 0^ computed according to (IB. 101) and (IB. 71) . respectively. 



C True and estimated covariance matrices of Section 16.11 



Because the loading matrices are not unique, for the simulated data of Exam- 
ples 1 and 2 we limit the attention to a comparison, for each g = 1, . . . , G, of 
true and estimated covariance matrices. 



C. 1 Example \6.1.1\ 



103.36 103.07 101.37 79.41 105.66 
103.08 119.39 110.23 85.97 115.47 

101.37 110.23 129.77 106.08 118.50 
79.41 85.97 106.08 101.46 95.21 

105.66 115.47 118.50 95.21 121.63 



Si 



107.59 114.55 110.42 87.29 114.43 
114.55 139.40 127.06 100.09 132.06 

110.42 127.06 146.31 122.92 134.12 
87.29 100.09 122.92 117.97 110.09 

114.43 132.06 134.12 110.09 135.66 



34.25 15.16 17.81 22.39 14.62 
15.16 17.01 11.42 13.98 8.95 
17.81 11.42 17.62 16.12 10.45 
22.39 13.98 16.12 28.11 13.11 
14.62 8.95 10.45 13.11 10.19 



22.16 7.44 13.71 12.89 10.12 
7.44 11.25 7.59 8.05 5.48 
13.71 7.59 18.83 13.53 10.13 
12.89 8.05 13.53 22.00 9.41 
10.12 5.48 10.13 9.41 8.63 
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C.2 Example \6JJ£ 



Si 



10.41 3.61 4.07 4.48 5.71 
3.61 7.83 2.88 3.18 4.03 
4.07 2.88 8.67 3.81 4.64 
4.48 3.18 3.81 9.61 5.17 
5.71 4.04 4.64 5.17 11.73 



8.86 3.89 5.06 3.84 5.72 
3.89 7.23 3.59 1.79 4.04 
5.06 3.59 8.44 3.85 5.50 
3.84 1.79 3.85 7.74 4.38 
5.72 4.04 5.50 4.38 9.81 



103.36 103.07 


101.37 79.41 105 


66 




106 


17 100.46 93 


18 


73.81 


105.01 


103.08 122.1 


110.23 85.97 115 


47 




100 


46 113.71 92 


97 


72.22 


107.88 


101.37 110.23 


134.33 106.08 118 


50 


±2 = 


93 


18 92.97 108 


25 


83.08 


102.36 


79.41 85.97 


106.08 102.73 95 


21 




73 


81 72.22 83 


08 


80.09 


81.85 


105.66 115.47 


118.50 95.21 129 


21 




105 


01 107.88 102 


36 


81.85 


122.59 



25.19 15.16 17.81 22.39 14.62 
15.16 10.67 11.42 13.98 8.95 
17.81 11.42 13.12 16.12 10.45 
22.39 13.98 16.12 20.31 13.11 
14.62 8.95 10.45 13.11 8.70 



32.47 


19.91 23.06 28.78 


18 


80 


19.91 


14.10 14.96 18.25 


11 


66 


23.06 


14.96 16.95 20.77 


13 


45 


28.78 


18.25 20.77 25.95 


16 


77 


18.80 


11.66 13.45 16.77 


11 


10 
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